35 research outputs found
Semantically Guided Depth Upsampling
We present a novel method for accurate and efficient up- sampling of sparse
depth data, guided by high-resolution imagery. Our approach goes beyond the use
of intensity cues only and additionally exploits object boundary cues through
structured edge detection and semantic scene labeling for guidance. Both cues
are combined within a geodesic distance measure that allows for
boundary-preserving depth in- terpolation while utilizing local context. We
model the observed scene structure by locally planar elements and formulate the
upsampling task as a global energy minimization problem. Our method determines
glob- ally consistent solutions and preserves fine details and sharp depth
bound- aries. In our experiments on several public datasets at different levels
of application, we demonstrate superior performance of our approach over the
state-of-the-art, even for very sparse measurements.Comment: German Conference on Pattern Recognition 2016 (Oral
Unsupervised Intuitive Physics from Visual Observations
While learning models of intuitive physics is an increasingly active area of
research, current approaches still fall short of natural intelligences in one
important regard: they require external supervision, such as explicit access to
physical states, at training and sometimes even at test times. Some authors
have relaxed such requirements by supplementing the model with an handcrafted
physical simulator. Still, the resulting methods are unable to automatically
learn new complex environments and to understand physical interactions within
them. In this work, we demonstrated for the first time learning such predictors
directly from raw visual observations and without relying on simulators. We do
so in two steps: first, we learn to track mechanically-salient objects in
videos using causality and equivariance, two unsupervised learning principles
that do not require auto-encoding. Second, we demonstrate that the extracted
positions are sufficient to successfully train visual motion predictors that
can take the underlying environment into account. We validate our predictors on
synthetic datasets; then, we introduce a new dataset, ROLL4REAL, consisting of
real objects rolling on complex terrains (pool table, elliptical bowl, and
random height-field). We show that in all such cases it is possible to learn
reliable extrapolators of the object trajectories from raw videos alone,
without any form of external supervision and with no more prior knowledge than
the choice of a convolutional neural network architecture
Associative3D: Volumetric Reconstruction from Sparse Views
This paper studies the problem of 3D volumetric reconstruction from two views
of a scene with an unknown camera. While seemingly easy for humans, this
problem poses many challenges for computers since it requires simultaneously
reconstructing objects in the two views while also figuring out their
relationship. We propose a new approach that estimates reconstructions,
distributions over the camera/object and camera/camera transformations, as well
as an inter-view object affinity matrix. This information is then jointly
reasoned over to produce the most likely explanation of the scene. We train and
test our approach on a dataset of indoor scenes, and rigorously evaluate the
merits of our joint reasoning approach. Our experiments show that it is able to
recover reasonable scenes from sparse views, while the problem is still
challenging. Project site: https://jasonqsy.github.io/Associative3DComment: ECCV 202
3D Fluid Flow Estimation with Integrated Particle Reconstruction
The standard approach to densely reconstruct the motion in a volume of fluid
is to inject high-contrast tracer particles and record their motion with
multiple high-speed cameras. Almost all existing work processes the acquired
multi-view video in two separate steps, utilizing either a pure Eulerian or
pure Lagrangian approach. Eulerian methods perform a voxel-based reconstruction
of particles per time step, followed by 3D motion estimation, with some form of
dense matching between the precomputed voxel grids from different time steps.
In this sequential procedure, the first step cannot use temporal consistency
considerations to support the reconstruction, while the second step has no
access to the original, high-resolution image data. Alternatively, Lagrangian
methods reconstruct an explicit, sparse set of particles and track the
individual particles over time. Physical constraints can only be incorporated
in a post-processing step when interpolating the particle tracks to a dense
motion field. We show, for the first time, how to jointly reconstruct both the
individual tracer particles and a dense 3D fluid motion field from the image
data, using an integrated energy minimization. Our hybrid Lagrangian/Eulerian
model reconstructs individual particles, and at the same time recovers a dense
3D motion field in the entire domain. Making particles explicit greatly reduces
the memory consumption and allows one to use the high-res input images for
matching. Whereas the dense motion field makes it possible to include physical
a-priori constraints and account for the incompressibility and viscosity of the
fluid. The method exhibits greatly (~70%) improved results over our recently
published baseline with two separate steps for 3D reconstruction and motion
estimation. Our results with only two time steps are comparable to those of
sota tracking-based methods that require much longer sequences.Comment: To appear in International Journal of Computer Vision (IJCV
Action recognition from weak alignment of body parts
We propose a method for human action recognition from still images that uses the silhouette and the upper body as a proxy for the pose of the person, and also to guide alignment between samples for the purpose of computing registered feature descriptors. Our contributions include an efficient algorithm, formulated as an energy minimization, for using the silhouette to align body parts between imaged human samples. The descriptors computed over the aligned body parts are incorporated, via a multiple kernel framework, together with other standard features (such as a deformable part model (DPM) and dense SIFT), to learn a classifier for each action class. Experiments on the challenging PASCAL VOC 2012 dataset shows that our method exceeds the state-of-the-art performance on the majority of action classes
Human pose estimation using a joint pixel-wise and part-wise formulation
Our goal is to detect humans and estimate their 2D pose in single images. In particular, handling cases of partial visibility where some limbs may be occluded or one person is partially occluding another. Two standard, but disparate, approaches have developed in the field: the first is the part based approach for layout type problems, involving optimising an articulated pictorial structure, the second is the pixel based approach for image labelling involving optimising a random field graph defined on the image. Our novel contribution is a formulation for pose estimation which combines these two models in a principled way in one optimisation problem and thereby inherits the advantages of both of them. Inference on this joint model finds the set of instances of persons in an image, the location of their joints, and a pixel-wise body part labelling. We achieve near or state of the art results on standard human pose data sets, and demonstrate the correct estimation for cases of self-occlusion, person overlap and image truncation
Human pose estimation using a joint pixel-wise and part-wise formulation
Our goal is to detect humans and estimate their 2D pose in single images. In particular, handling cases of partial visibility where some limbs may be occluded or one person is partially occluding another. Two standard, but disparate, approaches have developed in the field: the first is the part based approach for layout type problems, involving optimising an articulated pictorial structure, the second is the pixel based approach for image labelling involving optimising a random field graph defined on the image. Our novel contribution is a formulation for pose estimation which combines these two models in a principled way in one optimisation problem and thereby inherits the advantages of both of them. Inference on this joint model finds the set of instances of persons in an image, the location of their joints, and a pixel-wise body part labelling. We achieve near or state of the art results on standard human pose data sets, and demonstrate the correct estimation for cases of self-occlusion, person overlap and image truncation
Action recognition from weak alignment of body parts
We propose a method for human action recognition from still images that uses the silhouette and the upper body as a proxy for the pose of the person, and also to guide alignment between samples for the purpose of computing registered feature descriptors. Our contributions include an efficient algorithm, formulated as an energy minimization, for using the silhouette to align body parts between imaged human samples. The descriptors computed over the aligned body parts are incorporated, via a multiple kernel framework, together with other standard features (such as a deformable part model (DPM) and dense SIFT), to learn a classifier for each action class. Experiments on the challenging PASCAL VOC 2012 dataset shows that our method exceeds the state-of-the-art performance on the majority of action classes
Robust Higher Order Potentials for Enforcing Label Consistency
This paper proposes a novel framework for labelling problems which is able to combine multiple segmentations in a principled manner. Our method is based on higher order conditional random fields and uses potentials defined on sets of pixels (image segments) generated using unsupervised segmentation algorithms. These potentials enforce label consistency in image regions and can be seen as a generalization of the commonly used pairwise contrast sensitive smoothness potentials. The higher order potential functions used in our framework take the form of the Robust P n model and are more general than the P n Potts model recently proposed by Kohli et al. We prove that the optimal swap and expansion moves for energy functions composed of these potentials can be computed by solving a stmincut problem. This enables the use of powerful graph cut based move making algorithms for performing inference in the framework. We test our method on the problem of multi-class object segmentation by augmenting the conventional CRF used for object segmentation with higher order potentials defined on image regions. Experiments on challenging data sets show that integration of higher order potentials quantitatively and qualitatively improves results leading to much better definition of object boundaries. W